Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

MCN: Modulated Convolutional Network

g=1

g=10

h=1

h=2

h=20

Input feature maps

10×4×32×32

Output feature maps

20×4×30×30

Reconstructed filters

20×10ˈ4×4×32×32

sum

Groups

FIGURE 3.3

MCNs Convolution (MCconv) with multiple feature maps. There are 10 and 20 feature

maps in the input and the output, respectively. The reconstructed ﬁlters are divided into

20 groups, and each group contains 10 reconstructed ﬁlters, corresponding to the number

of feature maps and MC feature maps, respectively.

map, h = 1, i = 1, ..., 10, g = 1, ..., 10, and for the second output feature map, h = 2, i =

11, ..., 20, g = 1, ..., 10.

When the ﬁrst convolutional layer is considered, the input size of the network is

32 × 32 ². First, each image channel is copied K = 4 times, resulting in the new input

of size 4 × 32 × 32 to the entire network.

It should be noted that the number of input and output channels in every feature map

is the same, so MCNs can be easily implemented by simply replicating the same MCconv

module at each layer.

3.4.2

Loss Function of MCNs

To constrain CNNs to have binarized weights, we introduce a new loss function in MCNs.

Two aspects are considered: unbinarized convolutional ﬁlters are reconstructed based on

binarized ﬁlters; the intra-class compactness is incorporated based on output features. We

further introduce the variables used in this section: C^l

i ^{are unbinary ﬁlters of the}^l^{th con-}

volutional layer, l ∈{1, ..., N}; ^ˆC^l

i ^{denote binarized ﬁlters corresponding to}^C^l

i^;^M^l^denotes

the modulation ﬁlter (M-Filter) shared by all C^l

i ^{in the}^l^{th convolutional layer and}^M^l

represents the jth plane of M ^l; ◦is a new plane-based operation (Eq. 3.12) which is deﬁned

in the next section. We then have the ﬁrst part of the loss function for minimization:

LM = ^θ

i,l

∥C^l

i ⁻^ˆ^C^l

i ^◦^M^l^∥²⁺

∥fm( ^ˆC, ^⃗M) −f( ^ˆC, ^⃗M)∥²,

(3.18)

2We only use one channel of gray-level images (3 × 32 × 32)